## datatable function from DT package create an HTML widget display of the dataset
## install DT package if the package is not yet available in your R environment
readxl::read_excel("dataset/dataset-variable-description.xlsx") |>
DT::datatable()HR Analytics Employee Atrrition and Performance
BCon 147: special topics
1 Project overiew
In this project, we will explore employee attrition and performance using the HR Analytics Employee Attrition & Performance dataset. The primary goal is to develop insights into the factors that contribute to employee attrition. By analyzing a range of factors, including demographic data, job satisfaction, work-life balance, and job role, we aim to help businesses identify key areas where they can improve employee retention.
2 Scenario
Imagine you are working as a data analyst for a mid-sized company that is experiencing high employee turnover, especially among high-performing employees. The company has been facing increased costs related to hiring and training new employees, and management is concerned about the negative impact on productivity and morale. The human resources (HR) team has collected historical employee data and now looks to you for actionable insights. They want to understand why employees are leaving and how to retain talent effectively.
Your task is to analyze the dataset and provide insights that will help HR prioritize retention strategies. These strategies could include interventions like revising compensation policies, improving job satisfaction, or focusing on work-life balance initiatives. The success of your analysis could lead to significant cost savings for the company and an increase in employee engagement and performance.
3 Understanding data source
The dataset used for this project provides information about employee demographics, performance metrics, and various satisfaction ratings. The dataset is particularly useful for exploring how factors such as job satisfaction, work-life balance, and training opportunities influence employee performance and attrition.
This dataset is well-suited for conducting in-depth analysis of employee performance and retention, enabling us to build predictive models that identify the key drivers of employee attrition. Additionally, we can assess the impact of various organizational factors, such as training and work-life balance, on both performance and retention outcomes.
4 Data wrangling and management
Libraries
Before we start working on the dataset, we need to load the necessary libraries that will be used for data wrangling, analysis and visualization. Make sure to load the following libraries here. For packages to be installed, you can use the install.packages function. There are packages to be installed later on this project, so make sure to install them as needed and load them here.
# load all your libraries here
library(dplyr)
library(ggplot2)
library(tidyr)
library(janitor)
library(tidyverse)
library(readxl)
library(lubridate)
library(tidytext)
library(readr)
library(haven)
library(skimr)
library(magrittr)
library(DT)
library(GGally)
library(corrplot)
library(sjPlot)
library(report)
library(ggstatsplot)4.1 Data importation
Import the two dataset
Employee.csvandPerformanceRating.csv. Save theEmployee.csvasemployee_dtaandPerformanceRating.csvasperf_rating_dta.Merge the two dataset using the
left_joinfunction fromdplyr. Use theEmployeeIDvariable as the varible to join by. You may read more information about theleft_joinfunction here.Save the merged dataset as
hr_perf_dtaand display the dataset using thedatatablefunction fromDTpackage.
## import the two data here
employee_dta <- read.csv("C:\\Users\\keziah\\Desktop\\Rstudiooo\\midterm-bcon147-project-exercise-20241021T032309Z-001\\midterm-bcon147-project-exercise\\dataset\\Employee.csv")
perf_rating_dta <- read.csv("C:\\Users\\keziah\\Desktop\\Rstudiooo\\midterm-bcon147-project-exercise-20241021T032309Z-001\\midterm-bcon147-project-exercise\\dataset\\PerformanceRating.csv")
## merge employee_dta and perf_rating_dta using left_join function.
merged_data <- left_join(employee_dta, perf_rating_dta, by = "EmployeeID")
## save the merged dataset as hr_perf_dta
hr_perf_dta <- merged_data
## Use the datatable from DT package to display the merged dataset
datatable(hr_perf_dta)4.2 Data management
Using the
clean_namesfunction fromjanitorpackage, standardize the variable names by using the recommended naming of variables.Save the renamed variables as
hr_perf_dtato update the dataset.
## clean names using the janitor packages and save as hr_perf_dta
hr_perf_dta <- hr_perf_dta %>% clean_names()
## display the renamed hr_perf_dta using datatable function
datatable(hr_perf_dta)Create a new variable
cat_educationwhereineducationis1=No formal education;2=High school;3=Bachelor;4=Masters;5=Doctorate. Use thecase_whenfunction to accomplish this task.Similarly, create new variables
cat_envi_sat,cat_job_sat, andcat_relation_satforenvironment_satisfaction,job_satisfaction, andrelationship_satisfaction, respectively. Re-code the values accordingly as1=Very dissatisfied;2=Dissatisfied;3=Neutral;4=Satisfied; and5=Very satisfied.Create new variables
cat_work_life_balance,cat_self_rating,cat_manager_ratingforwork_life_balance,self_rating, andmanager_rating, respectively. Re-code accordingly as1=Unacceptable;2=Needs improvement;3=Meets expectation;4=Exceeds expectation; and5=Above and beyond.Create a new variable
bi_attritionby transformingattritionvariable as a numeric variable. Re-code accordingly asNo=0, andYes=1.Save all the changes in the
hr_perf_dta. Note that saving the changes with the same name will update the dataset with the new variables created.
## create cat_education
colnames(hr_perf_dta) [1] "employee_id" "first_name"
[3] "last_name" "gender"
[5] "age" "business_travel"
[7] "department" "distance_from_home_km"
[9] "state" "ethnicity"
[11] "education" "education_field"
[13] "job_role" "marital_status"
[15] "salary" "stock_option_level"
[17] "over_time" "hire_date"
[19] "attrition" "years_at_company"
[21] "years_in_most_recent_role" "years_since_last_promotion"
[23] "years_with_curr_manager" "performance_id"
[25] "review_date" "environment_satisfaction"
[27] "job_satisfaction" "relationship_satisfaction"
[29] "training_opportunities_within_year" "training_opportunities_taken"
[31] "work_life_balance" "self_rating"
[33] "manager_rating"
hr_perf_dta <- hr_perf_dta %>% mutate(cat_education = case_when(
education == 1 ~ "No formal education",
education == 2 ~ "High school",
education == 3 ~ "Bachelor",
education == 4 ~ "Masters",
education == 5 ~ "Doctorate",
TRUE ~ NA_character_
))
## create cat_envi_sat, cat_job_sat, and cat_relation_sat
hr_perf_dta <- hr_perf_dta %>% mutate(cat_envi_sat = case_when(
environment_satisfaction == 1 ~ "Very dissatisfied",
environment_satisfaction == 2 ~ "Dissatisfied",
environment_satisfaction == 3 ~ "Neutral",
environment_satisfaction == 4 ~ "Satisfied",
environment_satisfaction == 5 ~ "Very satisfied",
TRUE ~ NA_character_
)) %>%
# Recode job satisfaction
mutate(cat_job_sat = case_when(
job_satisfaction == 1 ~ "Very dissatisfied",
job_satisfaction == 2 ~ "Dissatisfied",
job_satisfaction == 3 ~ "Neutral",
job_satisfaction == 4 ~ "Satisfied",
job_satisfaction == 5 ~ "Very satisfied",
TRUE ~ NA_character_
)) %>%
# Recode relationship satisfaction
mutate(cat_relation_sat = case_when(
relationship_satisfaction == 1 ~ "Very dissatisfied",
relationship_satisfaction == 2 ~ "Dissatisfied",
relationship_satisfaction == 3 ~ "Neutral",
relationship_satisfaction == 4 ~ "Satisfied",
relationship_satisfaction == 5 ~ "Very satisfied",
TRUE ~ NA_character_))
datatable(hr_perf_dta) ## create cat_work_life_balance, cat_self_rating, and cat_manager_rating
hr_perf_dta <- hr_perf_dta %>% mutate(cat_work_life_balance = case_when(
work_life_balance == 1 ~ "Unacceptable",
work_life_balance == 2 ~ "Needs improvement",
work_life_balance == 3 ~ "Meets expectation",
work_life_balance == 4 ~ "Exceeds expectation",
work_life_balance == 5 ~ "Above and beyond",
TRUE ~ NA_character_
)) %>%
# Recode self-rating
mutate(cat_self_rating = case_when(
self_rating == 1 ~ "Unacceptable",
self_rating == 2 ~ "Needs improvement",
self_rating == 3 ~ "Meets expectation",
self_rating == 4 ~ "Exceeds expectation",
self_rating == 5 ~ "Above and beyond",
TRUE ~ NA_character_
)) %>%
# Recode manager rating
mutate(cat_manager_rating = case_when(
manager_rating == 1 ~ "Unacceptable",
manager_rating == 2 ~ "Needs improvement",
manager_rating == 3 ~ "Meets expectation",
manager_rating == 4 ~ "Exceeds expectation",
manager_rating == 5 ~ "Above and beyond",
TRUE ~ NA_character_
))
datatable(hr_perf_dta)## create bi_attrition
hr_perf_dta <- hr_perf_dta %>%
mutate(bi_attrition = if_else(attrition == "Yes", 1, 0))
datatable(hr_perf_dta)## print the updated hr_perf_dta using datatable function
datatable(hr_perf_dta)5 Exploratory data analysis
5.1 Descriptive statistics of employee attrition
Select the variables
attrition,job_role,department,age,salary,job_satisfaction, andwork_life_balance.Save asattrition_key_var_dta.Compute and plot the attrition rate across
job_role,department, andage,salary,job_satisfaction, andwork_life_balance. To compute for the attrition rate, group the dataset by job role. Afterward, you can use thecountfunction to get the frequency of attrition for each job role and then divide it by the total number of observations. Save the computation aspct_attrition. Do not forget to ungroup before storing the output. Store the output asattrition_rate_job_role.Plot for the attrition rate across
job_rolehas been done for you! Study each line of code. You have the freedom to customize your plot accordingly. Show your creativity!
## selecting attrition key variables and save as `attrition_key_var_dta`
attrition_key_var_dta <- hr_perf_dta %>%
select(attrition, job_role, department, age, salary, work_life_balance,job_satisfaction)
## compute the attrition rate across job_role and save as attrition_rate_job_role
attrition_rate_job_role <- hr_perf_dta %>%
group_by(job_role) %>%
summarize(pct_attrition = mean(attrition == "Yes", na.rm = TRUE) * 100) %>%
ungroup()
## print attrition_rate_job_role
datatable(attrition_rate_job_role)## compute the attrition rate across department and save as attrition_rate_department
attrition_rate_department <- hr_perf_dta %>%
group_by(department) %>%
summarise(
total_employees = n(),
total_attrition = sum(attrition == "Yes", na.rm = TRUE)
) %>%
mutate(pct_attrition = total_attrition / total_employees * 100) %>%
ungroup()
## print attrition_rate_department
datatable(attrition_rate_department)## compute the attrition rate across age and save as attrition_rate_age
attrition_rate_age <- attrition_key_var_dta %>%
mutate(age_group = cut(age, breaks = seq(20, 60, by = 5), right = FALSE, include.lowest = TRUE)) %>%
group_by(age_group) %>%
summarise(
Total_Employees = n(),
Total_Attrition = sum(attrition == "Yes"),
pct_attrition = (Total_Attrition / Total_Employees) * 100
) %>%
ungroup()
# Print the attrition_rate_age
datatable(attrition_rate_age)## compute the attrition rate across salary and save as attrition_rate_salary
attrition_rate_salary <- hr_perf_dta %>%
mutate(salary_range = cut(salary, breaks = c(0, 50000, 100000, 150000, 200000),
labels = c("0-50k", "50k-100k", "100k-150k", "150k+"))) %>%
group_by(salary_range) %>%
summarise(
total_employees = n(),
pct_attrition = mean(bi_attrition == 1, na.rm = TRUE) * 100)
# Print the attrition_rate_salary
datatable(attrition_rate_salary)# Compute the attrition rate across job_satisfaction
attrition_rate_job_satisfaction <- hr_perf_dta %>%
group_by(job_satisfaction) %>%
summarise(
total_employees = n(),
total_attrition = sum(attrition == "Yes", na.rm = TRUE)
) %>%
mutate(pct_attrition = total_attrition / total_employees * 100
) %>%
ungroup()
# Print the attrition_rate_job_satisfaction
datatable(attrition_rate_job_satisfaction)## compute the attrition rate across work_life_balance and save as attrition_rate_work_life_balance
attrition_rate_work_life_balance <- hr_perf_dta %>%
group_by(work_life_balance) %>%
summarise(
total_employees = n(),
total_attrition = sum(attrition == "Yes", na.rm = TRUE)
) %>%
mutate(pct_attrition = total_attrition / total_employees * 100) %>%
ungroup()
# Print the attrition_rate_work_life_balance
datatable(attrition_rate_work_life_balance)## Plot the attrition rate
## Plot attrition_rate_job_role
ggplot(attrition_rate_job_role, aes(x = reorder(job_role, -pct_attrition), y = pct_attrition)) +
geom_bar(stat = "identity", fill = "midnightblue") +
labs(title = "Attrition Rate by Job Role",
x = "job Role",
y = "Attrition Rate (%)") +
theme_minimal() +
coord_flip()## Plot attrition_rate_department
ggplot(attrition_rate_department, aes(y = reorder(department, -pct_attrition), x = pct_attrition)) +
geom_bar(stat = "identity", fill = "midnightblue") +
labs(title = "Attrition Rate by Department",
y = "Department",
x = "Attrition Rate (%)") +
theme_minimal() +
coord_flip() ## Plot attrition_rate_age
ggplot(attrition_rate_age, aes(y = reorder(age_group, -pct_attrition), x = pct_attrition)) +
geom_bar(stat = "identity", fill = "midnightblue") +
labs(title = "Attrition Rate by Age",
y = "Age",
x = "Attrition Rate") +
theme_minimal() +
coord_flip()##plot attrition_rate_salary
ggplot(attrition_rate_salary, aes(x = salary_range, y = pct_attrition)) +
geom_bar(stat = "identity", fill = "midnightblue", color = "lightblue") +
geom_text(aes(label = round(pct_attrition, 1)), vjust = -0.5) +
labs(title = "Attrition Rate by Salary Range", x = "Salary Range", y = "Attrition Rate (%)") +
theme_minimal()##plot attrition_rate_job_satisfaction
ggplot(attrition_rate_job_satisfaction, aes(y = reorder(job_satisfaction, -pct_attrition), x = pct_attrition)) +
geom_bar(stat = "identity", fill = "midnightblue") +
labs(title = "Attrition Rate by JOb Satisfaction",
x = "Job Satisfaction",
y = "Attrition Rate") +
theme_minimal() +
coord_flip()##plot attrition_rate_work_life_balance
ggplot(attrition_rate_work_life_balance, aes(y = reorder(work_life_balance, -pct_attrition), x = pct_attrition)) +
geom_bar(stat = "identity", fill = "midnightblue") +
labs(title = "Attrition Rate by Work Life Balance",
x = "Work Life Balance",
y = "Attrition Rate") +
theme_minimal() +
coord_flip()5.2 Identifying attrition key drivers using correlation analysis
Conduct a correlation analysis of key variables:
bi_attrition,salary,years_at_company,job_satisfaction,manager_rating, andwork_life_balance. Use thecor()function to run the correlation analysis. Remove missing values using thena.omit()before running the correlation analysis. Save the output inhr_corr.Use a correlation matrix or heatmap to visualize the relationship between these variables and attrition. You can use the
GGallypackage and use theggcorrfunction to visualize the correlation heatmap. You may explore this site for more information: ggcorr.Discuss which factors seem most correlated with attrition and what that suggests aobut why employees are leaving. ::: As it can be noticed, the strongest negative correlation is with the salary, thus the most appropriate strategy for decreasing attrition could be enhancing compensation systems’ arrangements and making sure that employees do not have the feeling that their compensation is unfair. In the same way, it is possible that evaluating some other conditions not discussed in this paper, including competition in the job market or opportunities for personal growth, will reveal more information about employees’ turnover.
## conduct correlation of key variables.
key_variables_clean <- hr_perf_dta %>%
select(bi_attrition, salary, years_at_company, job_satisfaction, manager_rating, work_life_balance) %>%
na.omit()
print(head(key_variables_clean)) bi_attrition salary years_at_company job_satisfaction manager_rating
1 0 102059 10 3 3
2 0 102059 10 4 2
3 0 102059 10 5 5
4 0 102059 10 3 4
5 0 102059 10 4 3
6 0 102059 10 2 4
work_life_balance
1 4
2 2
3 4
4 3
5 3
6 3
hr_corr <- cor(key_variables_clean, use = "complete.obs")
ggcorr(key_variables_clean,
label = TRUE,
label_round = 3,
palette = "RdBu",
midpoint = 0,
label_size = 2,
low = "lightblue",
high = "blue",
name = "Corr"
) +
ggtitle("Correlation Heatmap: Key Variables and Attrition") +
theme_minimal() ## print hr_corr
print(hr_corr) bi_attrition salary years_at_company job_satisfaction
bi_attrition 1.000000000 -0.211181478 -0.6896527798 0.0132368129
salary -0.211181478 1.000000000 0.2206442116 0.0053054850
years_at_company -0.689652780 0.220644212 1.0000000000 0.0008700583
job_satisfaction 0.013236813 0.005305485 0.0008700583 1.0000000000
manager_rating -0.007654429 -0.001596736 0.0178656879 -0.0158205481
work_life_balance 0.003428836 -0.001517145 0.0079339508 0.0417242942
manager_rating work_life_balance
bi_attrition -0.007654429 0.003428836
salary -0.001596736 -0.001517145
years_at_company 0.017865688 0.007933951
job_satisfaction -0.015820548 0.041724294
manager_rating 1.000000000 0.007996938
work_life_balance 0.007996938 1.000000000
## install GGally package and use ggcorr function to visualize the correlation
key_variables_clean <- hr_perf_dta %>%
select(bi_attrition, salary, years_at_company, job_satisfaction, manager_rating, work_life_balance) %>%
na.omit()
cor_matrix <- cor(key_variables_clean, use = "complete.obs")
ggcorr(cor_matrix,
label = TRUE,
label_round = 2,
hjust = 0.75,
size = 3,
low = "royalblue",
mid = "white",
high = "midnightblue")Provide your discussion here. ::: The strongest predictors of attrition in this dataset seem to be years at the company and salary. Employees with higher tenure and higher salaries are much less likely to leave. Other factors, such as job satisfaction, manager rating, and work-life balance, seem to have a minor or negligible impact on whether employees choose to leave. This could imply that financial incentives and company loyalty play larger roles in employee retention within this particular organization, compared to work environment or management factors. If the goal is to reduce attrition, focusing on increasing compensation and providing incentives for long-term retention could be effective strategies.
5.3 Predictive modeling for attrition
Create a logistic regression model to predict employee attrition using the following variables:
salary,years_at_company,job_satisfaction,manager_rating, andwork_life_balance. Save the model ashr_attrition_glm_model. Print the summary of the model using thesummaryfunction.Install the
sjPlotpackage and use thetab_modelfunction to display the summary of the model. You may read the documentation here on how to customize your model summary.Also, use the
plot_modelfunction to visualize the model coefficients. You may read the documentation here on how to customize your model visualization.Discuss the results of the logistic regression model and what they suggest about the factors that contribute to employee attrition.
## run a logistic regression model to predict employee attrition
hr_attrition_glm_model <- glm(bi_attrition ~ salary + years_at_company + job_satisfaction +
manager_rating + work_life_balance,
data = hr_perf_dta, family = binomial)
## save the model as hr_attrition_glm_model
saveRDS(hr_attrition_glm_model, file = "hr_attrition_glm_model.rds")
## print the summary of the model using the summary function
summary(hr_attrition_glm_model)
Call:
glm(formula = bi_attrition ~ salary + years_at_company + job_satisfaction +
manager_rating + work_life_balance, family = binomial, data = hr_perf_dta)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.571e+00 2.173e-01 11.831 <2e-16 ***
salary -3.633e-06 4.086e-07 -8.893 <2e-16 ***
years_at_company -6.333e-01 1.476e-02 -42.919 <2e-16 ***
job_satisfaction 3.470e-02 3.186e-02 1.089 0.276
manager_rating 5.071e-03 3.810e-02 0.133 0.894
work_life_balance 2.587e-02 3.198e-02 0.809 0.419
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8574.5 on 6708 degrees of freedom
Residual deviance: 4781.6 on 6703 degrees of freedom
(190 observations deleted due to missingness)
AIC: 4793.6
Number of Fisher Scoring iterations: 5
## install sjPlot package and use tab_model function to display the summary of the model
hr_attrition_glm_model <- glm(bi_attrition ~ salary + years_at_company + job_satisfaction +
manager_rating + work_life_balance,
data = hr_perf_dta,
family = binomial)
tab_model(hr_attrition_glm_model)| bi attrition | |||
| Predictors | Odds Ratios | CI | p |
| (Intercept) | 13.08 | 8.56 – 20.07 | <0.001 |
| salary | 1.00 | 1.00 – 1.00 | <0.001 |
| years at company | 0.53 | 0.52 – 0.55 | <0.001 |
| job satisfaction | 1.04 | 0.97 – 1.10 | 0.276 |
| manager rating | 1.01 | 0.93 – 1.08 | 0.894 |
| work life balance | 1.03 | 0.96 – 1.09 | 0.419 |
| Observations | 6709 | ||
| R2 Tjur | 0.502 | ||
## use plot_model function to visualize the model coefficients
hr_attrition_glm_model <- glm(bi_attrition ~ salary + years_at_company + job_satisfaction +
manager_rating + work_life_balance,
data = hr_perf_dta,
family = binomial)
plot_model(hr_attrition_glm_model, show.values = TRUE, value.offset = .3)Provide your discussion here. :::From the tab model resuls, the number of years an employee has spent at the company is the most important predictor of attrition. The other variables, including salary, job satisfaction, manager rating, and work-life balance, do not appear to be significant factors in determining an employee’s likelihood of leaving. From the plot model results, the years at company is the most significant factor associated with leaving the company. This could be because individuals who have been with the company longer are more likely to have developed strong relationships with their colleagues and may have a greater sense of commitment to the company.
5.4 Analysis of compensation and turnover
Compare the average monthly income of employees who left the company (
bi_attrition = 1) and those who stayed (bi_attrition = 0). Use thet.testfunction to conduct a t-test and determine if there is a significant difference in average monthly income between the two groups. Save the results in a variable calledattrition_ttest_results.Install the
reportpackage and use thereportfunction to generate a report of the t-test results.Install the
ggstatsplotpackage and use theggbetweenstatsfunction to visualize the distribution of monthly income for employees who left and those who stayed. Make sure to map thebi_attritionvariable to thexargument and thesalaryvariable to theyargument.Visualize the
salaryvariable for employees who left and those who stayed usinggeom_histogramwithgeom_freqpoly. Make sure to facet the plot by thebi_attritionvariable and applyalphaon the histogram plot.Provide recommendations on whether revising compensation policies could be an effective retention strategy.
## compare the average monthly income of employees who left and those who stayed
attrition_ttest_results <- t.test(salary ~ bi_attrition, data = hr_perf_dta)
## print the results of the t-test
print(attrition_ttest_results)
Welch Two Sample t-test
data: salary by bi_attrition
t = 18.869, df = 5524.2, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
38577.82 47523.18
sample estimates:
mean in group 0 mean in group 1
125007.26 81956.76
## install the report package and use the report function to generate a report of the t-test results
attrition_ttest_results <- t.test(salary ~ bi_attrition, data = hr_perf_dta)
report_ttest <- report(attrition_ttest_results)
# Print the report
print(report_ttest)Effect sizes were labelled following Cohen's (1988) recommendations.
The Welch Two Sample t-test testing the difference of salary by bi_attrition
(mean in group 0 = 1.25e+05, mean in group 1 = 81956.76) suggests that the
effect is positive, statistically significant, and medium (difference =
43050.50, 95% CI [38577.82, 47523.18], t(5524.24) = 18.87, p < .001; Cohen's d
= 0.51, 95% CI [0.45, 0.56])
# install ggstatsplot package and use ggbetweenstats function to visualize the
ggbetweenstats(
data = hr_perf_dta,
x = bi_attrition,
y = salary,
xlab = "Attrition (0 = Stayed, 1 = Left)",
ylab = "Salary",
title = "Salary Distribution for Employees Who Stayed vs Left",
messages = FALSE,
mean.plotting = TRUE
) +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title.x = element_text(size = 12, margin = margin(t = 10)),
axis.title.y = element_text(size = 12, margin = margin(r = 10)),
axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10),
plot.caption = element_text(size = 10)
)# create histogram and frequency polygon of salary for employees who left and those who stayed
ggplot(hr_perf_dta, aes(x = salary)) +
geom_histogram(aes(y = ..density..), bins = 30, fill = "blue", alpha = 0.4) + # Histogram with transparency
geom_freqpoly(aes(y = ..density..), color = "midnightblue", size = 1.2, binwidth = 5000) + # Frequency polygon
facet_wrap(~ bi_attrition, scales = "free") + # Facet by attrition status
labs(
title = "Salary Distribution and Frequency Polygon for Employees Who Stayed vs Left",
x = "Salary",
y = "Density"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12)
)Provide your discussion here. :::Based on the statistical analysis of employee salary data, revising compensation policies would likely be a highly effective retention strategy. The t-test results demonstrate a significant salary gap between employees who stayed and those who left (p < 2.2e-16), with retained employees earning notably more ($125,007 average) compared to departing employees ($81,957 average) - a substantial difference of $43,050. Organizations should prioritize implementing targeted salary adjustments for employees earning below $100,000, where data shows the highest attrition rates. While increased compensation alone won’t guarantee retention, the strong statistical correlation between higher salaries and employee retention suggests that a comprehensive compensation review, including structured salary bands and clear growth paths, should be a cornerstone of any retention strategy.
5.5 Employee satisfaction and performance analysis
Analyze the average performance ratings (both
ManagerRatingandSelfRating) of employees who left vs. those who stayed. Use thegroup_byandcountfunctions to calculate the average performance ratings for each group.Visualize the distribution of
SelfRatingfor employees who left and those who stayed using a bar plot. Use theggplotfunction to create the plot and map theSelfRatingvariable to thexargument and thebi_attritionvariable to thefillargument.Similarly, visualize the distribution of
ManagerRatingfor employees who left and those who stayed using a bar plot. Make sure to map theManagerRatingvariable to thexargument and thebi_attritionvariable to thefillargument.Create a boxplot of
salarybyjob_satisfactionandbi_attritionto analyze the relationship between salary, job satisfaction, and attrition. Use thegeom_boxplotfunction to create the plot and map thesalaryvariable to thexargument, thejob_satisfactionvariable to theyargument, and thebi_attritionvariable to thefillargument. You need to transform thejob_satisfactionandbi_attritionvariables into factors before creating the plot or within theggplotfunction.Discuss the results of the analysis and provide recommendations for HR interventions based on the findings.
# Analyze the average performance ratings (both ManagerRating and SelfRating) of employees who left vs. those who stayed.
result <- hr_perf_dta %>%
group_by(bi_attrition) %>% #0-who stayed, 1-hasleft
summarize(
count_manager_rating = n(),
avg_manager_rating = mean(manager_rating, na.rm = TRUE),
count_self_rating = n(),
avg_self_rating = mean(self_rating, na.rm = TRUE)
)
# Print the result
print(result)# A tibble: 2 × 5
bi_attrition count_manager_rating avg_manager_rating count_self_rating
<dbl> <int> <dbl> <int>
1 0 4638 3.48 4638
2 1 2261 3.46 2261
# ℹ 1 more variable: avg_self_rating <dbl>
# Visualize the distribution of SelfRating for employees who left and those who stayed using a bar plot.
self_rating_distribution <- hr_perf_dta %>%
group_by(bi_attrition, self_rating) %>%
summarize(count = n(), .groups = 'drop')
ggplot(self_rating_distribution, aes(x = factor(self_rating), y = count, fill = factor(bi_attrition))) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Self Rating", y = "Count", fill = "Attrition Status") +
scale_fill_manual(values = c("skyblue", "royalblue"), labels = c("Stayed (0)", "Left (1)")) +
theme_minimal() +
ggtitle("Distribution of Self Rating by Employee Attrition Status")# Visualize the distribution of ManagerRating for employees who left and those who stayed using a bar plot.
manager_rating_distribution <- hr_perf_dta %>%
group_by(bi_attrition, manager_rating) %>%
summarize(count = n(), .groups = 'drop')
ggplot(manager_rating_distribution, aes(x = factor(manager_rating), y = count, fill = factor(bi_attrition))) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Manager Rating", y = "Count", fill = "Attrition Status") +
scale_fill_manual(values = c("skyblue", "royalblue"), labels = c("Stayed (0)", "Left (1)")) +
theme_minimal() +
ggtitle("Distribution of Manager Rating by Employee Attrition Status")# create a boxplot of salary by job_satisfaction and bi_attrition to analyze the relationship between salary, job satisfaction, and attrition.
ggplot(hr_perf_dta, aes(x = factor(job_satisfaction), y = salary)) +
geom_boxplot() +
labs(x = "Job Satisfaction", y = "Salary") +
theme_minimal() +
facet_wrap(~ bi_attrition, labeller = labeller(bi_attrition = c("0" = "Stayed", "1" = "Left"))) +
ggtitle("Distribution of Salary by Job Satisfaction and Attrition Status")Provide your discussion here. :::The analysis of performance ratings, satisfaction, and attrition data reveals several interesting patterns. Most notably, there is minimal difference in both manager ratings (3.48 vs 3.46) and self-ratings (3.98 vs 3.99) between employees who stayed and those who left, suggesting that performance is not a primary driver of attrition. The distribution patterns show that while employees generally rate themselves highly (4-5 range), manager ratings follow a more normal distribution centered around 3-4, indicating a potential disconnect between self-perception and manager assessment. Furthermore, the salary and job satisfaction relationship demonstrates clear compensation disparities across all satisfaction levels between those who stayed and left, with lower salaries correlating to higher attrition regardless of job satisfaction level. :::Based on these findings, several HR interventions are recommended. First, the performance management system should be enhanced with more frequent discussions and clearer metrics to align self and manager perceptions. The compensation structure needs review and adjustment to ensure competitive salaries at all satisfaction levels, with regular market-rate reviews and clear growth paths tied to performance. Career development opportunities should be expanded since performance ratings suggest departing employees are equally capable, including structured progression frameworks and mentorship programs. Finally, implementing non-monetary recognition systems and peer recognition programs could complement these initiatives by boosting engagement and retention. These interventions should be implemented as an integrated strategy rather than isolated initiatives, with regular monitoring and adjustment based on effectiveness.
5.6 Work-life balance and retention strategies
At this point, you are already well aware of the dataset and the possible factors that contribute to employee attrition. Using your R skills, accomplish the following tasks:
- Analyze the distribution of WorkLifeBalance ratings for employees who left versus those who stayed.
work_life_balance_summary <- hr_perf_dta %>%
group_by(bi_attrition, work_life_balance) %>%
summarise(count = n(), .groups = "drop")
print(work_life_balance_summary)# A tibble: 11 × 3
bi_attrition work_life_balance count
<dbl> <int> <int>
1 0 1 84
2 0 2 1134
3 0 3 1090
4 0 4 1146
5 0 5 994
6 0 NA 190
7 1 1 37
8 1 2 568
9 1 3 580
10 1 4 560
11 1 5 516
- Use visualizations to show the differences.
ggplot(hr_perf_dta, aes(x = factor(work_life_balance), fill = factor(bi_attrition))) +
geom_bar(position = "dodge") +
labs(
title = "Distribution of Work-Life Balance for Employees Who Stayed vs Left",
x = "Work-Life Balance Rating",
y = "Count",
fill = "Attrition (0 = Stayed, 1 = Left)"
) +
theme_minimal() +
scale_fill_manual(values = c("lightblue", "blue"))- Assess whether employees with poor work-life balance are more likely to leave.
# Compute attrition rate by Work_life_balance
attrition_rate_wlb <- hr_perf_dta %>%
group_by(work_life_balance) %>%
summarise(
total_employees = n(),
total_attrition = sum(bi_attrition == 1),
attrition_rate = (total_attrition / total_employees) * 100
)
# Print the attrition rate summary
print(attrition_rate_wlb)# A tibble: 6 × 4
work_life_balance total_employees total_attrition attrition_rate
<int> <int> <int> <dbl>
1 1 121 37 30.6
2 2 1702 568 33.4
3 3 1670 580 34.7
4 4 1706 560 32.8
5 5 1510 516 34.2
6 NA 190 0 0
# Visualize the attrition rate by Work_life_balance
ggplot(attrition_rate_wlb, aes(x = factor(work_life_balance), y = attrition_rate)) +
geom_col(fill = "midnightblue") +
labs(
title = "Attrition Rate by Work-Life Balance Rating",
x = "Work-Life Balance Rating",
y = "Attrition Rate (%)"
) +
theme_minimal()You have the freedom how you will accomplish this task. Be creative and provide insights that will help HR develop effective retention strategies.
5.7 Recommendations for HR interventions
Based on the analysis conducted, provide recommendations for HR interventions that could help reduce employee attrition and improve overall employee satisfaction and performance. You may use the following question as guide for your recommendations and discussions.
- What are the key factors contributing to employee attrition in the company? :::The analysis suggests that salary, years at the company, and age group are the primary drivers of attrition. Employees with lower salaries tend to leave more frequently, while those with more years at the company are more likely to stay. Younger employees, particularly those in the 20-30 age group, exhibit higher attrition rates. This indicates that inadequate compensation and lack of early-career engagement might contribute to higher turnover.
- Which factors are most strongly correlated with attrition? :::Salary and years at the company are the strongest predictors of attrition. Employees with lower salaries and fewer years at the company are more likely to leave. Other factors, such as job satisfaction and work-life balance, appear to have weaker correlations with attrition, indicating that financial incentives and company loyalty play a larger role than environmental or managerial factors in this organization.
- What strategies could be implemented to improve employee retention and satisfaction? :::To improve retention, HR should prioritize revising compensation policies to ensure competitive salaries, especially for employees earning under $100,000. Implementing structured career development programs and improving onboarding processes for new employees could engage them early on, reducing turnover. Additionally, offering flexible work arrangements to improve work-life balance and conducting regular salary reviews could help retain staff.
- How can HR leverage the insights from the analysis to develop effective retention strategies? :::HR can leverage these findings to focus retention efforts where needed most. Salary adjustments should target lower-paid roles and employees with high turnover, such as younger staff or those in high-attrition departments like sales. Tailored programs like mentorship and career progression could also help reduce turnover among employees with fewer years at the company. Monitoring employee satisfaction and addressing issues as they arise can prevent dissatisfaction from escalating.
- What are the potential benefits of implementing these strategies for the company? :::The company could experience significant cost savings from reduced turnover, which will also alleviate the need for continuous recruitment and training. By enhancing employee satisfaction through better compensation and career development, engagement levels and productivity will likely increase. Retaining high-performing employees ensures long-term stability and strengthens the company’s ability to achieve strategic goals.